bounceR: Automated Feature Selection for Machine Learning Algorithms

Lukas Jan Stroemsdoerfer
Data Scientist @STATWORX

About STATWORX

We are a consulting company for data science, machine learning and statistics with offices in Frankfurt, Zurich and Stuttgart. We support our customers in the development and implementation of data science and machine learning solutions.

Our clients:
Our expertise:

About Our Workflow

Data science projects often follow a similiar structure. At the very beginning, one must load and prep the data, of course. Everything afterwards is fun, the first two parts are not.

About Feature Selection: Problem

About Feature Selection: Problem

About Feature Selection: Solutions?

Currently there are two main ways to select the relevant features out of the entire feature space:

About Our Idea

About Componentwise-Boosting

Componentwise Gradient Boosting is a boosting ensemble algorithm allowing to discriminate the relevance of features. In its essence, the method follows this algorithm:

About Our Algorithm: Goal

Find a feature selection algorithm that can distinguish relevant from irrelevant features without overfitting the training data.

About Our Algorithm: Idea

Each round a random stability score distribution is initialized. Over the course of \( m \) models, the distribution is adjusted. Essentially our code follows the algorithm:

About Our Algorithm: Procedure

Essentially we take bits form cool algorithms and put them together. For once, we leverage the complete randomness of random forests. Additionally we apply a somewhat transformed idea of backpropagation.

About Our Algorithm: Usage

Sure, you have a lot of tuning parameters, however we put them all together in a nice and handy little interface. By the way, we set the defaults based on several simulation studies, so you can - sort of - trust them - sometimes.

# Feature Selection using bounceR-----------------------------------------------------
selection <- featureSelection(data = train_df,                                      
                              target = "target",
                              index = NULL,
                              selection = selectionControl(n_rounds = 100,
                                                           n_mods = 1000,
                                                           p = NULL,
                                                           reward = 0.2,
                                                           penalty = 0.3,
                                                           max_features = NULL),
                              bootstrap = "regular",
                              boosting = boostingControl(mstop = 100, nu = 0.1),
                              early_stopping = "aic",
                              n_cores = 6)

About Our Package: Installation

The package is still under developmet and not yet listed on CRAN. However you can get it from GitHub.

# load devtools
install.packages(devtools)
library(devtools)

# download from our public repo
devtools::install_github("STATWORX/bounceR")

# source it
library(bounceR)

If you find any bugs or spot anything that is not super convenient, just open an issue.

About Our Package: Content

The package contains a variety of useful functions surrounding the topic of feature selection, such as:

  • Convenience:
    • sim_data: a function simulating regression and classification data, where the true feature space is known
  • Filtering:
    • featureFiltering: a function implementing several popular filter methods for feature selection
  • Wrapper:
    • featureSelection: a function implementing our home grown algorithm for feature selection
  • Methods:
    • print.sel_obj: an S4 priniting method for the object class “sel_obj”
    • plot.sel_obj: an S4 ploting method for the object class “sel_obj”
    • summary.sel_obj: an S4 summary method for the object class “sel_obj”
    • builder: method to extract a formula with n features from a “sel_obj”

About The End

If you have any questions, are interested or have an idea, just contact us!